# Required Libraries
library(car)

Question #1: Multiple Linear Regression on New York City Restaurants

Load the 04NYCRestaurants.txt dataset into your workspace. This dataset contains survey results from customers of 168 different Italian restaurants in the New York City area. The data are in the form of the average of customer views on various attributes (food, decor, and service) scored on a scale from 1 to 30, along with the average price of dinner. There is also a categorical variable for the location of the restaurant.

1.1: Create a scatterplot matrix of all continuous variables colored by Location. From this plot alone, do you see any problems that might arise for multiple linear regression?
rest <- read.table('04NYCRestaurants.txt', header = TRUE, sep = " ", quote = "\"", stringsAsFactors = FALSE)

rest$Location <- as.factor(rest$Location)

plot(rest[, 2:5], col = rest$Location)

Looking at the scatterplot matrix there appears to be many variables that are correlated with each other.

1.2: Fit a multiple linear regression predicting the price of a meal based on the customer views and location of the restaurant. For this model:
  1. Write out the regression equation. Price = -21.956 + 1.538Food + 1.910Decor - 0.003Service -2.068LocationWest

  2. Interpret the meaning each of the 5 coefficients in context of the problem. Intercept coefficient - assuming food, decor and service were rated 0 and the location was east we would assume an average price of -21.956. This doenst make sense in terms of the context of the problem, it is just the fixed point where the line is anchored.

    Food coefficient - holding all else constant an increase of 1 in the food rating will increase price on average by 1.54

    Decor coefficient - holding all else constant an increase of 1 in the decor rating will increase price on average by 1.91

    Service coefficient - holding all else constant an increase of 1 in the service rating will decrease price on average by 0.003

    LocationWest coefficient - holding all else constant a restaurant in the West is on average 2.07 cheaper than a restaurant in the East

  3. Are the coefficients significant? How can you tell?

Based on the multiple linear regression, the Intercept, Food, Decor, and LocationWest coefficients are statistically significant (p-value less than 0.05)

  1. Is the overall regression significant? How can you tell?

The overall model is significant, the F test shows a p-value less than .05

  1. Find and interpret the RSE.

The RSE is 5.738, estimated standard deviation of the residual errors

  1. Find and interpret the adjusted coefficient of determination.

The adjusted coefficient of determination is .6187, this means that roughly 62% of the variation in price can be explained by the included variables in the model

restModFull <- lm(Price ~ . -Restaurant, data = rest)
summary(restModFull)
## 
## Call:
## lm(formula = Price ~ . - Restaurant, data = rest)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.0465  -3.8837   0.0373   3.3942  17.7491 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -21.955750   4.857969  -4.520 1.19e-05 ***
## Food           1.538120   0.368951   4.169 4.96e-05 ***
## Decor          1.910087   0.217005   8.802 1.87e-15 ***
## Service       -0.002727   0.396232  -0.007   0.9945    
## LocationWest  -2.068050   0.946739  -2.184   0.0304 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.738 on 163 degrees of freedom
## Multiple R-squared:  0.6279, Adjusted R-squared:  0.6187 
## F-statistic: 68.76 on 4 and 163 DF,  p-value: < 2.2e-16
1.3: I nvestigate the assumptions of the model using the plot() function. Are there any violations?
plot(restModFull)

The QQ plot appears to show a violation of the normality assumption. The scale location plot also appears to show a violation of the independent errors assumption.

1.4: Investigate the influence plot for the model. Are there any restaurants about which we should be concerned?
influencePlot(restModFull)

##       StudRes        Hat      CookD
## 56  3.2666518 0.05010858 0.32600253
## 130 2.9463084 0.07181092 0.35815562
## 168 0.4012884 0.21011533 0.09279813

There are a few restaurants that are of concern, 56 and 130 have high residuals, however their leverage is low. Restaurant 168 has high leverage but a small residual.

1.5: Investigate the coefficient variance inflation factors; use these values to discuss multicollinearity.
vif(restModFull)
##     Food    Decor  Service Location 
## 2.714261 1.744851 3.558735 1.064985

The VIF for service is the highest (3.56), which was expected given that it was the least significant coefficient.

1.6: Create added variable plots for this model. What conclusions might you draw from these plots?
avPlots(restModFull)

The added variable plots show that Food and Decor are the most powerful predictors in the model (as expected based on coefficient p-values). While location has some impact it is relatively small. The service variable appears to be the least helpful in predicting price.

1.7: Fit a new simple linear regression that predicts the price of dinner from the service rating alone. Discuss this regression in light of your answer to part 6.
restModSvc <- lm(Price ~ Service, data = rest)
summary(restModSvc)
## 
## Call:
## lm(formula = Price ~ Service, data = rest)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6646  -4.7540  -0.2093   4.3368  26.2460 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11.9778     5.1093  -2.344   0.0202 *  
## Service       2.8184     0.2618  10.764   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.153 on 166 degrees of freedom
## Multiple R-squared:  0.4111, Adjusted R-squared:  0.4075 
## F-statistic: 115.9 on 1 and 166 DF,  p-value: < 2.2e-16
plot(Price ~ Service, data = rest)
abline(restModSvc, lty = 2)

Creating a simple linear regression to predict price based soley on service rating appears to be statistically significant (p value < 0.05). According to the model an increase of 1 in the service rating would increase price on average by 2.82. However, based on our previous model it appears that the service rating is influenced by the quality of food and decor in the restaurant, therefore a better model would be to predict price based on food, decor and location.

Question #2: Model Selection on New York City Restaurants

2.1: Regress the price of dinner onto the average customer food rating, decor rating, and the restaurant location. In context of this new model, comment on:
  1. The model summary() output. This model appears to be better than the full model. All coefficients are now statistically significant. The model based on the F-test is still significant (p-value less than .05). We have a small decrease in the RSE to 5.72 and a slight increase in the adjusted coefficient of determination to .6211

  2. The assumptions of multiple linear regression. As with the full model, the QQ plot appears to show a violation of the normality assumption. In addition, the scale location plot also appears to show a violation of the independent errors assumption.

  3. The influence plot of the model. 56 and 130 are still outliers in this model. 117 has relatively high leverage but a low residual error.

  4. The variance inflation factors of the coefficients. By removing the service variable the VIF for the remaining coefficients was reduced.

  5. The added variable plots for the model. The added variable plots show that Food and Decor are the most powerful predictors in the model (as expected based on coefficient p-values). While location has some impact it is relatively small.

restModNew <- lm(Price ~ . -Restaurant -Service, data = rest)
summary(restModNew)
## 
## Call:
## lm(formula = Price ~ . - Restaurant - Service, data = rest)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.0451  -3.8809   0.0389   3.3918  17.7557 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -21.9599     4.8063  -4.569 9.59e-06 ***
## Food           1.5363     0.2632   5.838 2.76e-08 ***
## Decor          1.9094     0.1900  10.049  < 2e-16 ***
## LocationWest  -2.0670     0.9318  -2.218   0.0279 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.72 on 164 degrees of freedom
## Multiple R-squared:  0.6279, Adjusted R-squared:  0.6211 
## F-statistic: 92.24 on 3 and 164 DF,  p-value: < 2.2e-16
plot(restModNew)

influencePlot(restModNew)

##       StudRes        Hat     CookD
## 56  3.2282969 0.02245111 0.2378828
## 117 0.4445866 0.18087990 0.1047159
## 130 2.9380865 0.06091881 0.3657472
vif(restModNew)
##     Food    Decor Location 
## 1.389515 1.346030 1.038000
avPlots(restModNew)

2.2: Run a partial F-test to compare this model with the overall model you created in question 1. Interpret your results.
anova(restModNew, restModFull)
## Analysis of Variance Table
## 
## Model 1: Price ~ (Restaurant + Food + Decor + Service + Location) - Restaurant - 
##     Service
## Model 2: Price ~ (Restaurant + Food + Decor + Service + Location) - Restaurant
##   Res.Df    RSS Df Sum of Sq  F Pr(>F)
## 1    164 5366.5                       
## 2    163 5366.5  1   0.00156  0 0.9945

Given that the p-value is greater than 0.05 we cannot reject the null hypothesis that the service coefficient is zero, therefore the model which excludes the service variable is a better model.

2.3: Fit a new reduced model that predicts the price of dinner by only the average customer food rating and average customer decor rating. Briefly comment on the model assumptions.
restModFD <- lm(Price ~ Food + Decor, data = rest)
summary(restModFD)
## 
## Call:
## lm(formula = Price ~ Food + Decor, data = rest)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.945  -3.766  -0.153   3.701  18.757 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -24.5002     4.7230  -5.187 6.19e-07 ***
## Food          1.6461     0.2615   6.294 2.68e-09 ***
## Decor         1.8820     0.1919   9.810  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.788 on 165 degrees of freedom
## Multiple R-squared:  0.6167, Adjusted R-squared:  0.6121 
## F-statistic: 132.7 on 2 and 165 DF,  p-value: < 2.2e-16
plot(restModFD)

By removing the location variable we see a slight increase in RSE to 5.788 and a slight decrease in the R^2 value to 0.6121. As with the previous two models, we still see that the QQ plot appears to show a violation of the normality assumption. In addition, the scale location plot also appears to show a violation of the independent errors assumption.

2.4: Compare each of the following models based on AIC:
  1. The overall model fitted in question 1.
  2. The overall model without the service variable fitted in question 2 part 1.
  3. The reduced model fitted in question 2 part 3.
AIC(restModFull, restModNew, restModFD)
##             df      AIC
## restModFull  6 1070.711
## restModNew   5 1068.711
## restModFD    4 1071.677

Based on the AIC, the model that includes food, decor and location is the best model (with the lowest AIC). The worst model is the one that only includes food and decor.

2.5: Compare each of the models based on BIC.
BIC(restModFull, restModNew, restModFD)
##             df      BIC
## restModFull  6 1089.454
## restModNew   5 1084.330
## restModFD    4 1084.173

Based on the BIC, the model that includes just food and decor is the best model (with the lowest BIC), though only slightly better than the model which also includes location. The worst model is the one that includes all variables.

2.6: Do you expect to see the results from part 4 and part 5? Which model would you ultimately choose to use?

The results from part 4 and 5 were expected. I would ultimately choose to use the model which includes food, decor and location.